1 Introduction and set-up

This project is made to be read in html, so open the html file in your preferred webbrowser. As standard the code is hidden in this document, but you can show all by pressing the button “Code” in the top right of the document. You can also show individual chunks of code by pressing the buttons “Code” which are placed around in the document.

Link for google colab:

Link for github: https://github.com/DataEconomistDK/M2-Group-Assignment

R-rules: (delete before handin) Data may not be overwritten, but may always be given new meaningfull names. data_raw, data_tidy ect. Always lower case letters. Inline code comments are only for technical use ect. All other explanation should be text. We load all packages in the top. We write in document without any branches.

In this project we will work with a dataset of 5.000 consumer reviews for a few Amazon electronic products like f. ex. Kindle. Data is collected between September 2017 and October 2018. This is a sample taken from Kaggle which is a part of a much bigger dataset available trough Datafiniti. The data can be collected from this link: https://www.kaggle.com/datafiniti/consumer-reviews-of-amazon-products?fbclid=IwAR1o_blPfHeBPmnUzAOW7Ct24L7fhbI3OGcbfaVgaDZENhVXwaCP4godKvQ#Datafiniti_Amazon_Consumer_Reviews_of_Amazon_Products.csv

Note there is 3 available dataset on kaggle, but the file used here is called “Datafiniti_Amazon_Consumer_Reviews_of_Amazon_Products”. The file is downloaded as is, and imported further below.

1.1 Loading packages

First i have some personal setup in my local R-Markdown on how i want to display warnings ect. And then i load my packages.

### Knitr options
knitr::opts_chunk$set(warning=FALSE,
                     message=FALSE,
                     fig.align="center"
                     )

options(warn=-1) # Hides all warnings, as the knitr options only work on local R-Markdown mode. 

Sys.setenv(LANG = "en")
# Packages

if (!require("pacman")) install.packages("pacman") # package for loading and checking packages :)
pacman::p_load(knitr, # For knitr to html
               rmarkdown, # For formatting the document
               tidyverse, # Standard datasciewnce toolkid (dplyr, ggplot2 et al.)
               data.table, # for reading in data ect. 
               magrittr,# For advanced piping (%>% et al.)
               igraph, # For network analysis
               tidygraph, # For tidy-style graph manipulation
               ggraph, # For ggplot2 style graph plotting
               Matrix, # For some matrix functionality
               ggforce, # Awesome plotting
               kableExtra, # Formatting for tables
               car, # recode functions 
               tidytext, # Structure text within tidyverse
               topicmodels, # For topic modelling
               tm, # text mining library
               quanteda, # for LSA (latent semantic analysis)
               uwot, # for UMAP
               dbscan, # for density based clustering
               SnowballC,
               wordcloud, 
               textstem, # for textstemming 
               tidyr
               )

# I set a seed for reproduciability
set.seed(123) # Have to be set every time a rng proces is being made. 

1.2 Loading and filtering data

Now we load the data we downloaded from kaggle. From this file we select the following variables:

  • id: An id number given to each review created by us corrensponding to the row number of the raw data.

  • name: The full name of the product

  • reviews.rating: The rating of the product on a scale from 1-5.

  • reviews.title: The title of the review, given by the customer.

  • reviews.text: The review text written by the customer.

data_raw <- read_csv("Datafiniti_Amazon_Consumer_Reviews_of_Amazon_Products.csv") %>% 
  select(name, reviews.rating, reviews.text, reviews.title) %>% 
  mutate(id = row_number())

str(data_raw)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 5000 obs. of  5 variables:
##  $ name          : chr  "Amazon Kindle E-Reader 6\" Wifi (8th Generation, 2016)" "Amazon Kindle E-Reader 6\" Wifi (8th Generation, 2016)" "Amazon Kindle E-Reader 6\" Wifi (8th Generation, 2016)" "Amazon Kindle E-Reader 6\" Wifi (8th Generation, 2016)" ...
##  $ reviews.rating: num  3 5 4 5 5 5 5 4 5 5 ...
##  $ reviews.text  : chr  "I thought it would be as big as small paper but turn out to be just like my palm. I think it is too small to re"| __truncated__ "This kindle is light and easy to use especially at the beach!!!" "Didnt know how much i'd use a kindle so went for the lower end. im happy with it, even if its a little dark" "I am 100 happy with my purchase. I caught it on sale at a really good price. I am normally a real book person, "| __truncated__ ...
##  $ reviews.title : chr  "Too small" "Great light reader. Easy to use at the beach" "Great for the price" "A Great Buy" ...
##  $ id            : int  1 2 3 4 5 6 7 8 9 10 ...
tokens_raw <- unnest_tokens(data_raw, word, reviews.text)
str(tokens_raw)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 155258 obs. of  5 variables:
##  $ name          : chr  "Amazon Kindle E-Reader 6\" Wifi (8th Generation, 2016)" "Amazon Kindle E-Reader 6\" Wifi (8th Generation, 2016)" "Amazon Kindle E-Reader 6\" Wifi (8th Generation, 2016)" "Amazon Kindle E-Reader 6\" Wifi (8th Generation, 2016)" ...
##  $ reviews.rating: num  3 3 3 3 3 3 3 3 3 3 ...
##  $ reviews.title : chr  "Too small" "Too small" "Too small" "Too small" ...
##  $ id            : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ word          : chr  "i" "thought" "it" "would" ...

155.258 tokens.

Cleaning numbers and weird characters.

tokens_clean <- tokens_raw %>%
  mutate(word = word %>% str_remove_all("[^[:alnum:]]")) %>% 
  mutate(word = word %>% str_remove_all("[[:digit:]]")) %>% 
  mutate(word = word %>% str_remove_all("[^a-zA-Z0-9]")) %>%
  filter(str_length(word) > 0)

str(tokens_clean)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 153994 obs. of  5 variables:
##  $ name          : chr  "Amazon Kindle E-Reader 6\" Wifi (8th Generation, 2016)" "Amazon Kindle E-Reader 6\" Wifi (8th Generation, 2016)" "Amazon Kindle E-Reader 6\" Wifi (8th Generation, 2016)" "Amazon Kindle E-Reader 6\" Wifi (8th Generation, 2016)" ...
##  $ reviews.rating: num  3 3 3 3 3 3 3 3 3 3 ...
##  $ reviews.title : chr  "Too small" "Too small" "Too small" "Too small" ...
##  $ id            : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ word          : chr  "i" "thought" "it" "would" ...

2 Network analysis

In this assignment we want to use network analysis to gain new insights into how the reviews are structured. Here we extract bigrams from each review text, clean and prepare them to then create networks. Where we before considered tokens as individual words, we can create them as n-grams that are a consecutive sequence of words. Bigrams are n-grams with a length of 2 consecutive words. This can be used to gain context and connection between words. These are now created. Then the most common bigrams are displayed.

bigrams <- data_raw %>%
  unnest_tokens(bigram, reviews.text, token = "ngrams", n = 2) # n is the number of words to consider in each n-gram. 

#Counting common bigrams
bigrams %>% 
  count(bigram, sort = TRUE)

Notice the most common bigrams are: ….? These are mostly stopwords, which is not very usefull for the analysis. To remove these from the bigrams, we now split the bigram into 2 columns word1 and word2, and then filter them away if either of them is a stopword. The stopwords are taken from a dictionary called stop_words. Now we make a new count to see the most bigrams.

bigrams$bigram[1:2]
## [1] "the display" "display is"

Remember that each bigram overlap as can be seen from above so that the first token is “the display” and the second is “display is”.

bigrams_separated <- bigrams %>% 
  separate(bigram,c("word1","word2"),sep = " ")

bigrams_filtered <- bigrams_separated %>% 
  filter(!word1 %in% stop_words$word) %>% 
  filter(!word2 %in% stop_words$word)

#New bigram counts
bigram_counts <- bigrams_filtered %>% 
  count(word1, word2, sort = TRUE)

bigram_counts

The most common bigrams are now product names such as …, as seen from above. We now combine the 2 columns again into a single column with the bigram.

bigrams_united <- bigrams_filtered %>% 
  unite(bigrams, word1, word2,sep = " ")

bigrams_united
#Trigram
data_raw  %>% 
  unnest_tokens(trigram, reviews.text, token = "ngrams",n=3) %>% 
  separate(trigram, c("word1","word2","word3"),sep = " ") %>% 
  filter(!word1 %in% stop_words$word,
         !word2 %in% stop_words$word,
         !word3 %in% stop_words$word) %>% 
  count(word1,word2,word3, sort = TRUE)
bigram_tf_idf <- bigrams_united %>%
  count(reviews.title, bigrams) %>%
  bind_tf_idf(bigrams, reviews.title, n) %>%
  arrange(desc(tf_idf))

bigram_tf_idf
tokens_clean$word<-lemmatize_words(tokens_clean$word)
#Lemmatazion
tokens_clean %>% 
  count(word,sort =TRUE)
options(max.print = 10)
unique(tokens_clean$word)
##  [1] "i"     "think" "it"    "would" "be"    "as"    "big"   "small"
##  [9] "paper" "but"  
##  [ reached getOption("max.print") -- omitted 4121 entries ]
#Stemmed
tokens_stemmed <- tokens_clean %>% 
  mutate(word=wordStem(word))
unique(tokens_stemmed$word)
##  [1] "i"     "think" "it"    "would" "be"    "a"     "big"   "small"
##  [9] "paper" "but"  
##  [ reached getOption("max.print") -- omitted 3730 entries ]

3 NLP

tokens_clean %>% count(word, sort=TRUE) %>% head(100)
tokens_clean %<>%
  anti_join(stop_words)
own_stopwords <- tibble(word= c("im", "ive", "dont", "doesnt", "didnt"), lexicon = "OWN")
tokens_clean %<>% anti_join(stop_words %>% bind_rows(own_stopwords), by = "word") 
tokens_stemmed = tokens_clean %<>%
  add_count(id, word, name = "nword") %>%
  add_count(id, name = "ntweet") %>%
  filter(nword > 1 & ntweet > 5) %>%
    select(-nword, -ntweet)
tokens_stemmed <- tokens_stemmed %>% mutate(word = wordStem(word))
topwords <- tokens_stemmed %>% count(word, sort=TRUE)
topwords %>%
  top_n(20, n) %>%
  ggplot(aes(x = word %>% fct_reorder(n), y = n)) +
  geom_col() +
  coord_flip() +
  labs(title = "Word Counts", 
       x = "Frequency", 
       y = "Top Words")

topwords %>%
  ggplot(aes(x = n)) + 
  geom_histogram()

wordcloud(topwords$word, topwords$n, random.order = FALSE, max.words = 50, colors = brewer.pal(8,"Dark2"))

3.1 TF-IDF

  • comment on not running a tf-idf
group_words = tokens_stemmed %>% count(name, word, sort=TRUE)
total_words = group_words %>% group_by(name) %>% summarize(total = sum(n)) 
tokens_tfidf = left_join(group_words, total_words)
tokens_tfidf <- tokens_tfidf %>% bind_tf_idf(word, name, n)

Now I’ll remove the column total and arrange it by the highest tf-idf.

tokens_tfidf %>% select(-total) %>% arrange(desc(tf_idf))

4 Machine learning